{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Quantitative Finance\n",
    "\n",
    "Copyright (c) 2019 Python Charmers Pty Ltd, Australia, <https://pythoncharmers.com>. All rights reserved.\n",
    "\n",
    "<img src=\"img/python_charmers_logo.png\" width=\"300\" alt=\"Python Charmers Logo\">\n",
    "\n",
    "Published under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. See `LICENSE.md` for details.\n",
    "\n",
    "Sponsored by Tibra Global Services, <https://tibra.com>\n",
    "\n",
    "<img src=\"img/tibra_logo.png\" width=\"300\" alt=\"Tibra Logo\">\n",
    "\n",
    "\n",
    "## Module 2.3: Testing and Benchmarking\n",
    "\n",
    "### 2.3.2 Confidence Intervals\n",
    "\n",
    "A statistical analysis without discussion of confidence or range is not complete.\n",
    "\n",
    "Statistics is about dealing with uncertainty, and when a statistical analysis gives a confident \"the mean is 3.2\" result, there is information missing here, specifically around how confident we are in that result and where we can reasonably expect the *actual* value to end up. This is also why political polls jump around so much in the news - they don't really, just that newspapers rarely report confidence intervals, so when sample mean naturally jumps around, this is the only value that is reported.\n",
    "\n",
    "Confidence intervals are a key measure to use here, and one of the easiest to explain, especially to non-statistical stakeholders. A confidence interval for a given estimate, at a given threshold X% is an interval for where X% of the expected values sit in that interval. Let's look at an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run setup.ipy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Module from 1.3.2 - Multivariate OLS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import quandl\n",
    "\n",
    "interest_rates = quandl.get(\"RBA/F13_FOOIRATCR\")\n",
    "interest_rates = interest_rates[interest_rates.columns[0]]  # Extract the first column, whatever it is called\n",
    "interest_rates.name = \"InterestRate\"  # Rename, as the original had a long name. Hint: don't use spaces or special chars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "inflation = quandl.get(\"RBA/G01_GCPIAGSAQP\")\n",
    "inflation.columns = ['Inflation']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "au_dollar = quandl.get(\"BUNDESBANK/BBEX3_M_AUD_USD_CM_AC_A01\")['Value']\n",
    "au_dollar.name = \"AUDUSD\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.concat([interest_rates, inflation, au_dollar], axis=1)  # Combines multiple series into a DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import statsmodels.formula.api as smf\n",
    "est = smf.ols(formula='Inflation ~ InterestRate + AUDUSD', data=data).fit()  # Does the constant for us"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>        <td>Inflation</td>    <th>  R-squared:         </th> <td>   0.102</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.087</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   6.929</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 09 Feb 2024</td> <th>  Prob (F-statistic):</th>  <td>0.00141</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>01:28:37</td>     <th>  Log-Likelihood:    </th> <td> -100.11</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>   125</td>      <th>  AIC:               </th> <td>   206.2</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>   122</td>      <th>  BIC:               </th> <td>   214.7</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     2</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "        <td></td>          <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>    <td>    0.2446</td> <td>    0.308</td> <td>    0.794</td> <td> 0.429</td> <td>   -0.365</td> <td>    0.855</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>InterestRate</th> <td>    0.0646</td> <td>    0.017</td> <td>    3.720</td> <td> 0.000</td> <td>    0.030</td> <td>    0.099</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>AUDUSD</th>       <td>    0.0730</td> <td>    0.385</td> <td>    0.190</td> <td> 0.850</td> <td>   -0.688</td> <td>    0.834</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>49.562</td> <th>  Durbin-Watson:     </th> <td>   1.909</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.000</td> <th>  Jarque-Bera (JB):  </th> <td> 455.460</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.997</td> <th>  Prob(JB):          </th> <td>1.25e-99</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td>12.136</td> <th>  Cond. No.          </th> <td>    56.6</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:              Inflation   R-squared:                       0.102\n",
       "Model:                            OLS   Adj. R-squared:                  0.087\n",
       "Method:                 Least Squares   F-statistic:                     6.929\n",
       "Date:                Fri, 09 Feb 2024   Prob (F-statistic):            0.00141\n",
       "Time:                        01:28:37   Log-Likelihood:                -100.11\n",
       "No. Observations:                 125   AIC:                             206.2\n",
       "Df Residuals:                     122   BIC:                             214.7\n",
       "Df Model:                           2                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "================================================================================\n",
       "                   coef    std err          t      P>|t|      [0.025      0.975]\n",
       "--------------------------------------------------------------------------------\n",
       "Intercept        0.2446      0.308      0.794      0.429      -0.365       0.855\n",
       "InterestRate     0.0646      0.017      3.720      0.000       0.030       0.099\n",
       "AUDUSD           0.0730      0.385      0.190      0.850      -0.688       0.834\n",
       "==============================================================================\n",
       "Omnibus:                       49.562   Durbin-Watson:                   1.909\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):              455.460\n",
       "Skew:                           0.997   Prob(JB):                     1.25e-99\n",
       "Kurtosis:                      12.136   Cond. No.                         56.6\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "est.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we let our `InterestRate` variable be $X_1$, then we can see that our model gives a 95% confidence interval between a low of 0.030 and a high of 0.102 (your results may change, see this line):\n",
    "\n",
    "<img src=\"img/confidence_interval_highlight.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A key part of the need for a confidence interval is that we almost always have error in our estimate, at the very least due to the sample we take. This is true even for things that seem ground truth. For instance, if we are predicting stock prices based on the close price, we have a sampling error - if the market closed a second earlier, we may get a different close price.\n",
    "\n",
    "Confidence intervals can be calculated lots of different ways, and the general process is the same whether your process is a classical statistical one, a Bayesian or a Simulation methodology.\n",
    "\n",
    "* For classical statistics, many distribution types have a method to compute the confidence interval, based on manipulation of the equations of those distributions. See `scipy.stats.norm.interval` for information on how to do this for a normal distribution. One example that most have heard about is \"for a normal distribution, 95% of values fit within two standard errors\".\n",
    "* For Bayesian statistics and simulations, the formulas and simulations create confidence intervals through their varied predictions. Create many bootstrap samples (same size as the original sample, however sampled with replacement) and compute the statistic for each bootstrap sample. After sorting, take the value 2.5% of the way through the data, and 97.5% of the way through. This range is the 95% confidence interval. This process is generally known as the bootstrap method.\n",
    "\n",
    "As noted in the previous notebook, the value of 95 in \"95% confidence interval\" has literally no special meaning - it is just a value many people choose. Don't use this value without some thought about it, especially if you are making decisions related to this value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common usage of confidence intervals is to see if a given value sits within it, provide a pseudo-likelihood that value is \"possible\". For instance, if the 95% confidence interval for the slope of a line contains 0, some would say there is a possibility of \"no correlation\" between the two. As noted in the last notebook, this confuses the term and is not a reliable methodology to use.\n",
    "\n",
    "Like p-values, confidence intervals are often misinterpreted. If you get a confidence interval of $a$ and $b$, this does **not** mean that 95% of individual measurements will fit between $a$ and $b$. It means that when we take samples (of the same size we used to compute the confidence interval), 95% of the values for the calculated statistic (such as the mean), will fit inside those bounds.\n",
    "\n",
    "As a note on reporting, you generally shouldn't give confidence intervals to too many decimal places. Saying \"the confidence interval for average height is between 162 and 182cm\" is better than saying \"the confidence interval is between 162.243cm and 182.976cm\", because the latter makes it seem like the process is much more formal than it really is. Remember you likely just used a 95% confidence interval because that's what most people use.\n",
    "\n",
    "Using the bootstrap method, we are not limited to computing the confidence interval on the mean, as we are with more classical statistics (well, not limited, but it is very hard to do much else). The resampling method used in bootstrap statistics allows for us to calculate arbitrary statistics from the dataset.\n",
    "\n",
    "Confidence intervals are affected by two main factors:\n",
    "\n",
    "* Sample size. Larger sample sizes lead to lower confidence intervals, due to the fact that the sample is \"more like\" the population, by virtue of having more of the population in it. Therefore, all samples are \"more like\" each other, and our interval will be smaller.\n",
    "* Variation within the population. Confidence intervals are wider when the variation in the population itself is wider. This makes samples \"less like\" each other, leading to greater different values in different samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "1. Plot the interest rates data above using `altair`, as a line plot.\n",
    "2. Add error bars to your plot. See the Altair gallery for examples on how to do this.\n",
    "3. Fit a normal distribution to the means-of-samples of the interest rate data (i.e. sample the data, compute the mean, repeat many times). Compute the confidence interval using the `scipy.stats.norm.interval` method.\n",
    "4. Use a bootstrap method, where you sort all the sample means and take the 95% confidence interval as the \"middle 95%\" noted above.\n",
    "5. Compare the results from (3) and (4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the solutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-f250115c270447ee8a0fad3c926da253.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-f250115c270447ee8a0fad3c926da253.vega-embed details,\n",
       "  #altair-viz-f250115c270447ee8a0fad3c926da253.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-f250115c270447ee8a0fad3c926da253\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-f250115c270447ee8a0fad3c926da253\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-f250115c270447ee8a0fad3c926da253\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.16.3?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.16.3\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300}}, \"layer\": [{\"mark\": {\"type\": \"line\"}, \"encoding\": {\"x\": {\"field\": \"Date\", \"type\": \"temporal\"}, \"y\": {\"field\": \"InterestRate\", \"type\": \"quantitative\"}}, \"transform\": [{\"calculate\": \"datum.InterestRate-datum.std\", \"as\": \"ymin\"}, {\"calculate\": \"datum.InterestRate+datum.std\", \"as\": \"ymax\"}]}, {\"mark\": {\"type\": \"errorbar\"}, \"encoding\": {\"x\": {\"field\": \"Date\", \"type\": \"temporal\"}, \"y\": {\"field\": \"ymin\", \"type\": \"quantitative\"}, \"y2\": {\"field\": \"ymax\"}}, \"transform\": [{\"calculate\": \"datum.InterestRate-datum.std\", \"as\": \"ymin\"}, {\"calculate\": \"datum.InterestRate+datum.std\", \"as\": \"ymax\"}]}], \"data\": {\"url\": \"altair-data-cdaa144340b5d81477c4e61852014b5a.json\", \"format\": {\"type\": \"json\"}}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.16.3.json\"}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.LayerChart(...)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 1 & 2\n",
    "d = data.reset_index().dropna()\n",
    "d[\"std\"] = np.std(d[\"InterestRate\"]) # Add a column for std\n",
    "\n",
    "import altair as alt\n",
    "\n",
    "# the base chart\n",
    "base = alt.Chart(d).transform_calculate(\n",
    "    ymin=\"datum.InterestRate-datum.std\",\n",
    "    ymax=\"datum.InterestRate+datum.std\"\n",
    ")\n",
    "# Data Line\n",
    "line = base.mark_line().encode(x=\"Date\", y=\"InterestRate\")\n",
    "# Error Bars\n",
    "bar = base.mark_errorbar().encode(x=\"Date\", y=\"ymin:Q\", y2=\"ymax:Q\")\n",
    "line + bar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4.127339872446008, 5.456089460887325)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 3\n",
    "from math import sqrt\n",
    "pool = data[\"InterestRate\"].dropna()\n",
    "\n",
    "sample_size = int(len(pool)/5)\n",
    "iters = range(1000)\n",
    "means = []\n",
    "\n",
    "for i in iters:\n",
    "    sample = np.random.choice(pool, sample_size)\n",
    "    means.append(np.mean(sample))\n",
    "\n",
    "mu = np.mean(means)\n",
    "std = np.std(means)\n",
    "stats.norm.interval(0.95, loc=mu, scale=std)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4.153999999999999, 5.554666666666667)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 4 \n",
    "pool = data[\"InterestRate\"].dropna()\n",
    "\n",
    "sample_size = int(len(pool)/5)\n",
    "iters = range(1000)\n",
    "means = []\n",
    "\n",
    "for i in iters:\n",
    "    sample = np.random.choice(pool, sample_size)\n",
    "    means.append(np.mean(sample))\n",
    "\n",
    "idx_lower = int(len(means)*.025)\n",
    "idx_upper = int(len(means)*.975)\n",
    "sorted_means = np.sort(means)\n",
    "lower = sorted_means[idx_lower]\n",
    "upper = sorted_means[idx_upper]\n",
    "lower, upper"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Very similar results, as expected from 1000 samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For solutions, see `solutions/confidence_intervals.py`*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Worked Example\n",
    "\n",
    "In this worked example, we will compute the 90% confidence interval for the proportion of times the following is true:\n",
    "\n",
    "    If the price of IBM increases on a given day, Microsoft will increase the following day.\n",
    "\n",
    "To do this, we first get our data, and then take a sample. We'll use daily closing prices to determine \"increase\". \n",
    "\n",
    "Note also we aren't testing a correlation. We don't care so much about \"if IBM drops, will Microsoft drop?\", just that if IBM increases, Microsoft will)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "ibm = quandl.get(\"WIKI/IBM\")['Close']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "msft = quandl.get(\"WIKI/MSFT\")['Close']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Date\n",
       "1986-03-13    28.00\n",
       "1986-03-14    29.00\n",
       "1986-03-17    29.50\n",
       "1986-03-18    28.75\n",
       "1986-03-19    28.25\n",
       "Name: Close, dtype: float64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msft.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Combine to make analysis easier\n",
    "stocks = pd.DataFrame({\"ibm\": ibm, \"msft\": msft})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we compute the two intermediate pieces of information:\n",
    "\n",
    "1. Did IBM increase on the day?\n",
    "2. Did MSFT increase on the day?\n",
    "\n",
    "We can then offset (2) to be \"Did MSFT increase the day after?\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "stocks['ibm_up'] = stocks['ibm'].diff() > 0\n",
    "stocks['msft_up'] = stocks['msft'].diff() > 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "stocks.dropna(inplace=True, how='any')  # Removes rows missing some data. Effectively starts from MSFT IPO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ibm</th>\n",
       "      <th>msft</th>\n",
       "      <th>ibm_up</th>\n",
       "      <th>msft_up</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1986-03-13</th>\n",
       "      <td>150.50</td>\n",
       "      <td>28.00</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-14</th>\n",
       "      <td>150.38</td>\n",
       "      <td>29.00</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-17</th>\n",
       "      <td>150.88</td>\n",
       "      <td>29.50</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-18</th>\n",
       "      <td>152.38</td>\n",
       "      <td>28.75</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-19</th>\n",
       "      <td>151.63</td>\n",
       "      <td>28.25</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               ibm   msft  ibm_up  msft_up\n",
       "Date                                      \n",
       "1986-03-13  150.50  28.00    True    False\n",
       "1986-03-14  150.38  29.00   False     True\n",
       "1986-03-17  150.88  29.50    True     True\n",
       "1986-03-18  152.38  28.75    True    False\n",
       "1986-03-19  151.63  28.25   False    False"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stocks.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Offset msft_up to be \"will it increase tomorrow?\"\n",
    "stocks['msft_up_tomorrow'] = stocks['msft_up'].shift(-1)  # -1 \"shifts upwards\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ibm</th>\n",
       "      <th>msft</th>\n",
       "      <th>ibm_up</th>\n",
       "      <th>msft_up</th>\n",
       "      <th>msft_up_tomorrow</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1986-03-13</th>\n",
       "      <td>150.50</td>\n",
       "      <td>28.00</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-14</th>\n",
       "      <td>150.38</td>\n",
       "      <td>29.00</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-17</th>\n",
       "      <td>150.88</td>\n",
       "      <td>29.50</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-18</th>\n",
       "      <td>152.38</td>\n",
       "      <td>28.75</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986-03-19</th>\n",
       "      <td>151.63</td>\n",
       "      <td>28.25</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               ibm   msft  ibm_up  msft_up msft_up_tomorrow\n",
       "Date                                                       \n",
       "1986-03-13  150.50  28.00    True    False             True\n",
       "1986-03-14  150.38  29.00   False     True             True\n",
       "1986-03-17  150.88  29.50    True     True            False\n",
       "1986-03-18  152.38  28.75    True    False            False\n",
       "1986-03-19  151.63  28.25   False    False            False"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stocks.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract just the values for which our premise is true:\n",
    "premise_true = stocks[stocks['ibm_up']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.49712858926342074"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get an overall estimate for our conclusion being true as well:\n",
    "premise_true['msft_up_tomorrow'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, our estimate is 49% that if IBM increased today, MSFT will increase tomorrow. Effectively random.\n",
    "\n",
    "Let's compute the confidence interval for this using the bootstrap method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_msft_follows_ibm_statistic(original_data):\n",
    "    # Encapsulates code above for \"MSFT increase follows an IBM increase\"\n",
    "    sample = original_data.sample(replace=True, n=len(original_data))\n",
    "    return sample['msft_up_tomorrow'].mean()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4968789013732834"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_msft_follows_ibm_statistic(premise_true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "number_experiments = 10000\n",
    "\n",
    "values = np.array([compute_msft_follows_ibm_statistic(premise_true) for i in range(number_experiments)])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_confidence_interval(values, ci=0.95):\n",
    "    \"\"\"Computer confidence interval for the given values\"\"\"\n",
    "    assert 0 < ci <= 1\n",
    "    n = len(values)\n",
    "    lower = int(n * (1-ci)/2)\n",
    "    upper = int(n * ci / 2)\n",
    "    assert upper > lower  # Can be lower == upper if not enough samples\n",
    "    sorted_values = np.sort(values)\n",
    "    return sorted_values[lower], sorted_values[upper]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.48414481897627965, 0.49637952559300874)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_confidence_interval(values, ci=0.9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While the value is *near* 0.50, which would indicate that there is no value to our assumption, we can see that the confidence bound is actually less than 0.50 at the 0.9 confidence level. \n",
    "\n",
    "We could misinterpret this and suddenly start trading on the pattern \"if IBM increases, short MSFT\". Our evidence does support this idea, but proper backtesting would be needed. Note though, the importance of the confidence interval in this decision. By the mean alone, the value was so close to 0.50 that most would just write it off as \"roughly a coin flip, so no further research to be done\". A small amount of coding gets us confidence intervals and \"there may be a slight edge here that is exploitable\".\n",
    "\n",
    "That said, always check the confidence intervals. Remember that a confidence interval of 0.90 roughly equates to \"if we do 10 experiments at a CI of 0.9, one in ten will be wrong\". Here is the same analysis for different ci levels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CI\tFalse positives in...\n",
      "0.50000\t1 in 2\n",
      "0.75000\t1 in 4\n",
      "0.80000\t1 in 5\n",
      "0.85000\t1 in 6\n",
      "0.90000\t1 in 10\n",
      "0.95000\t1 in 19\n",
      "0.97500\t1 in 39\n",
      "0.99000\t1 in 99\n",
      "0.99900\t1 in 999\n",
      "0.99990\t1 in 10000\n",
      "0.99999\t1 in 100000\n"
     ]
    }
   ],
   "source": [
    "print(\"CI\\tFalse positives in...\")\n",
    "for ci in [0.5, 0.75, 0.8, 0.85, 0.9, 0.95, 0.975, 0.99, 0.999, 0.9999, 0.99999]:\n",
    "    number_wrong = int(1/(1-ci))\n",
    "    print(\"{ci:.5f}\\t1 in {number_wrong}\".format(**locals()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those last few values are colloquially referred to as \"three 9s\", \"four 9s\" and \"five 9s\" and so on, especially in studies of reliability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Modify the worked example to test this hypothesis:\n",
    "\n",
    "    If IBM drops by more than 5% on a given day, MSFT will increase the following day.\n",
    "    \n",
    "Provide a single estimate for the probability of this happening, as well as a confidence interval.\n",
    "    \n",
    "#### Extended Exercise\n",
    "\n",
    "Modify further to test this hypothesis:\n",
    "\n",
    "    If IBM drops by more than 5% over a given week, MSFT will increase the following week."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
